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Abstract — Context: Software Engineering research makes use 
of collections of software artifacts (corpora) to derive empirical 
evidence from. Goal: To Improve quality and reproducibility 
of research, we need to understand the characteristics of used 
corpora. Method: For that, we perform a literature survey using 
grounded theory. We analyze the latest proceedings of seven 
relevant conferences. Results: While almost all papers use corpora 
of some kind with the common case of collections of source code of 
open-source Java projects, there are no frequently used projects 
or corpora across all the papers. For some conferences we can 
detect recurrences. We discover several forms of requirements 
and applied tunings for corpora which Indicate more specllic 
needs of research efforts. Conclusion: Our survey feeds into a 
quantitative basis for discussing the current state of empirical 
research in software engineering, thereby enabling ultimately 
Improvement of research quality specifically in terms of use (and 
reuse) of empirical evidence. 

I. Introduction 

This is a survey on software engineering research with 
focus on the use of collections of software artifacts (corpora) 
to derive empirical evidence from. Such focus on corpora 
was triggered by our own research on specifically software 
reverse/re-engineering and program comprehension, e.g., stud- 
ies on API or language usage [1|, |2|, |3|, |4| — with the 
common use of corpora for validation in the broader sense. 
The survey applies to conferences that fit with this context. 
One can observe a diversity of involved methodologies and 
characteristics of the collections of empirical evidence as they 
are leveraged in SB research. Thus, we embarked on the 
present literature survey with the following central research 
questions: 

I How often do Software Engineering papers use corpora — 

collections of empirical evidence? 
II What is the nature and characteristics of the used corpora? 
Ill Does common contents occur in the used corpora? 

For this, we collected and analyzed the latest proceeding^ 
of the following conferences: European Conference on Soft- 
ware Maintenance and Reengineering (CSMR), International 
Symposium on Empirical Software Engineering and Measure- 
ment (ESEM), International Conference on Program Compre- 
hension (ICPC), International Conference on Software Main- 
tenance (ICSM), Working Conference on Mining Software 
Repositories (MSR), Working Conference on Source Code 
Analysis and Manipulation (SCAM), Working Conference on 



'As they ai'e available 
[http://dblp.uni-trier.de/ 



from the DBLP bibliography service, 



Reverse Engineering (WCRE). We choose these conferences 
because i) they cover software engineering topics that, based 
on our experience, we expect to make use of empirical 
evidence; ii) they cover ground related to our expertise and 
research focus on software reverse/re-engineering and program 
comprehension with ESEM as notable addition for broader 
coverage of empirical software engineering research; iii) the 
conferences are of comparable size. In our survey, we use 
only long papers. We choose to analyze only conference 
proceedings, because while journal articles may adhere to the 
best practices, conference proceedings arguably contain the 
most common practices of research in the community — and 
we are interested in the latter 

SE research has been surveyed before; see Table H for 
a summary. The cited surveys focus on specific forms or 
characteristics of SE research to be analyzed with a predefined 
schema. For instance, Kitchenham et al. surveyed SE journals 
and conferences to find out adoption rate of systematic liter- 
ature reviews |8|. Similarly, Sj0berg et al. sought to find and 
analyze existing controlled experiments in SE research ||6]- 
By contrast, we (first to our knowledge) seek to discover 
whatever empirical evidence is used to facilitate SE research 
and we allow our coding schema to emerge from the data. 
We follow the idea of Grounded Theory (GT) as understood 
by Glaser |9| (on the difference between Straussarian and 
Glaserian versions see filOj ). 

The paper is organized as follows: i|II] describes the 
methodology underlying this literature survey. iHIIl presents 
the results of the survey. i llVI discusses related surveys. fjV] 
identifies threats to validity, i l VII concludes the paper. 

II. Methodology 

Empirical research is usually perceived as taking one of 
the forms: controlled and quasi-experiments, exploratory and 
confirmatory case studies, survey, ethnography, and action 
research [H], [12|. In a broader sense, empirical research also 
includes any research based on collected evidence — quoting 
from [111: "Empirical research seeks to explore, describe, 
predict, and explain natural, social, or cognitive phenomena by 
using evidence based on observation or experience. It involves 
obtaining and interpreting evidence by, e.g., experimentation, 
systematic observation, interviews or surveys, or by the careful 
examination of documents or artifacts [emphasis added]." 

Since Software Engineering is a practical area of Com- 
puter Science, it is logical to expect that most of the SE 
research is evidence-based, i.e., empirical de facto and in the 



TABLE I. 



Literature surveys on Software Engineering research 



Name 



Ref Year # used 



# papers 



Coding schema 



sel. 



Glass et al. 
Sj0berg et al. 
Zannier et al. 



Kitchenham 
et al. 

Our study 



m 

IS 

|7] 2006 



HI 



2002 
2005 



2013 



1995-1999 



1993-2002 
1975-2005 



2009 10 3 2004-2007 



2011/2012 



— 369 369 Characteristics of SE 
research 

5453 103 103 Controlled experiments 



1227 63 44 Empirical evaluation: 

quantity and soundness 



2506 33 19 Systematic reviews 



227 175 175 Empirical evidence 



Topics, research approaches and methods, 
theoretical basis, level of analysis 

Extent, topic, subjects, taslc and environment, 
replication, internal and external validity 

Study type, sampling type, target and used 
population, evaluaton type, proper use of 
analysis, usage of hypotheses 

Inclusion and exclusion criteria, coverage, 
quality/validity assessment, description of the 
basic data 

Emerged classification 



Legend: j and c stand for Journals and conferences; sel. and rel. stand for selected and relevant. 



TABLE II. 



Conferences used in the survey 



Year 


Conference 


# papers 




total 


long 


2012 


CSMR 


30 


30 


2012 


ESEM 


43 


24 


2012 


ICPC 


23 


21 


2011 


ICSM 


36 


36 


2012 


MSR 


29 


18 


2011 


SCAM 


19 


19 


2011 


WCRE 


47 


27 




Total 


227 


175 



present study, we submit to substantiate this expectation. We 
believe that a bottom-up approach of observing what exists 
and discovering methodology as well as definitions of forms 
of research complements the prominent top-down approach, 
when a methodology is derived from theoretical considerations 
or by borrowing from other sciences (medicine, sociology, 
psychology). 

This survey is particularly concerned with (collections of) 
empirical evidence. Thus, the following questions guide the 
research: 

I How often do Software Engineering papers use corpora — 

collections of empirical evidence? 
II What is the nature and characteristics of the used corpora? 
Ill Does common contents occur in the used corpora? 

For that, we collected the papers from the latest edition 
of seven SE conferences: CSMR, ESEM, ICPC, ICSM, MSR, 
SCAM, and WCRE (see Table |II] for details). We used DBLP 
pages of conferences to identify long papers and downloaded 
them from digital libraries. 

We then proceeded to read the papers to perform coding. 
From a previously done, smaller and more specific literature 
survey |13| and a pilot study for the present survey, we 
had some basic understanding of the parts of the scheme to 
emerge. During the first pass of coding, we started with the 
empty scheme and completed it eventually to arrive at the 
current scheme, as described below. During the second pass, 
we compared profiles of coded papers against the latest version 
of the scheme, we went through the papers again and filled in 
the missing details. 



While we were interested primarily in characteristics of 
used empirical evidence (specifically corpora), we also ex- 
tracted additional information about research reported in the 
papers: used tools, signs of rigorousness/quality, etc. We put 
the collected information in several groups: 

1) Corpora: We captured what was used as study objects 
(e.g., projects), what are their characteristics (e.g., language, 
open- vs. closed-source, code form), what are the requirements 
to the study objects, do they come from a specific source (e.g., 
established dataset or online repository), were they observed 
over a time (e.g., versions or revisions), what is the nature of 
preparation of the corpus. 

2) Forms of empirical research: During coding, several 
structural forms evolved that we used for capturing information 
conveyed in papers: experiments, questionnaires, literature 
surveys, and comparisons. Some relationships between forms 
and corpora usage also emerged. 

3) Self-classification: For each paper we captured what 
words authors use to describe their effort: e.g., case study, 
experiment. 

4} Tools: We collected mentions of existing tools (e.g.. 
Eclipse, R, Weka) that were used as well as of introduced 
tools that were presented in the papers. (In many cases, these 
tools are used to analyze or to otherwise process corpora.) 

5} Structural signs of rigorousness/quality: We paid atten- 
tion to the following aspects of the study presentation: Do 
authors use research questions? Null hypotheses? Is there a 
section on definitions and terms? Is validation mentioned? Is 
there a "Threats to validity" section? Are threats addressed in 
any structured way? 

6) Reproducibility: We tried to understand in each case, 
if a study can be reproduced. (Obviously, the use of corpora 
affects the definition of reproducibility.) We paid attention to 
the following signs: Are all details provided for a possible 
study replication (i.e., versions of used projects, time periods, 
etc.)? Do authors provide any material used in the paper, e.g., 
on a supplementary website? Altogether, would it be possible 
to reproduce the study? 

7) Assessment: Finally, we characterized the process of 
coding: how easy it was to extract information and how 
confident we are in the result. 



We did the pilot survey in September-October 2012. After 
that, we adjusted our methodology (e.g., instead of filtering 
papers based on their abstract, we decided to survey all 
the papers) and proceeded to perform the current study in 
November 2012-January 2013. We used Python and Bash 
scripts, Google Refine tooQ, and R projecO to process the 
data. We provide online the list of the papers and results of 
codingj. 

III. Results 

In this section, we present the results of our study. We 
group them similarly to the description provided in Section [III 
details about detected corpora, emerged forms of empirical re- 
search, used or introduced tools, signs of rigorousness/quality 
of research, reproducibility of the studies, and, finally, assess- 
ment of our effort. When we use the phrase "on the average ", 
we imply the median of the appropriate distribution. 



Next to the numbers, we provide framed highlights. 

We use formula "X out of Y papers" to provide feeling 
for the numbers. E.g., "one out of three papers" means that 
in every three surveyed papers there is one that has the 
discussed characteristic. 

We also provide conference-wise percentage of found 
characteristics. The table below illustrates the format on 
an artificial example: conferences are listed from left to 
right as the percentage increases. Percentage is always 
given relative to the total number of the long papers in 
the conference. Where appropriate, below the percentage 
appear names of the most popular projects, requirements, 
tunings within the conferences. When more than one name 
is given, each of them appear with the specified frequency. 

Artificial example 



CSMR ESEM 

1 % 2 % 



ICPC 

3% 



ICSM 

4% 



MSR 

5% 



SCAM WCRE 
6 % 7 % 



A. Corpora 

1) Usage: We marked a paper as containing a corpus when 
the paper mentioned a collection of software artifacts used for 
deriving empirical evidence. Altogether, we have found 198 
corpora used in 165 papers out of 175 surveyed papers. 

In 28 cases, we decided that a paper contains more than 
one corpus. We did so consistently, when we met at least two 
of the following motivations mentioned in the paper when 
describing the purpose of collected empirical evidence: for 
benchmark or oracle (6 corpora), for training (6 corpora), 
for evaluation (5 corpora), for investigation (5 corpora), for 
testing (4 corpora), for investigating quality like accuracy or 
scalability (4 corpora). 

We have found that 168 corpora (used in 145 papers), 
consist of projects (systems, software); in other cases, corpora 
consist of another kind of study object: image, trace, feature. 



^ http://code.google.eom/p/goo gle-refine/ 1 
^ http://www.r-project.Org/i 



web log, etc. TiU the end of the current subsection (IIII-Al i. we 
restrict ourselves to the corpora consisting of projects and call 
them project-based corpora. 



Almost all papers use 
out of six papers has more 
corpora consist of projects. 

Project-based 


a corpus of some sort. One 
than one corpus. Most of the 

corpora usage 


ESEM 
58% 


ICPC 

81% 


WCRE SCAM 

81% 84% 


CSMR 

87% 


MSR 

89% 


ICSM 
94% 


1 



2) Contents: We identified the following common charac- 
teristics of project-based corpora. 

Size. Half of the corpora, 99 cases, have three or less 
projects (of them, 45 corpora consist of only one project). 
There are 24 corpora that contain more than 10 projects. 
We detected large corpora (with more than 100 projects) 
in 8 papers — one of them introducing an established dataset 
itself. 

Languages. Most of the corpora are monolingual (147 
cases); most of the remaining ones are bilingual (19 cases). 
As for the software language, 106 corpora contain projects 
written in Java, while C-like languages are used in 50 corpora 
(in C-like languages we include C, C-H-, C#). 

Code form. In 125 cases, corpora consist of source code; 
in 15 cases — of binaries. In the rest of the cases, code of the 
projects is not used, something else is in focus (developers, 
requirements, etc.) 

Access. In 128 cases, corpora consist only of open-source 
projects; in 12 cases, corpora consist only of projects not 
available publicly (e.g., industrial software); in 9 cases, corpora 
are self-written. The remaining cases mix access forms. 

Projects. We collected names of the used projects as they 
are provided by the papers (modulo merging of names like 
Vuze/Azureuo • 

Table Hill lists projects frequently 
used in the corpora. Eclipse is a com- 
plex project, and some corpora make 
use of its sub-parts, considering them 
as projects on their own (e.g., JDT 
Search, PDE Build) — counting such 
cases, there are altogether 22 papers 
making use of Eclipse. 



TABLE III. 


Used 


PROJECTS 


Project 


# corp 


JHotDraw 


15 


JEdit 


12 


Ant 


11 


ArgoUML 


11 


Eclipse 


11 


Firefox 


10 


Vuze/Azureus 


8 


Linux kernel 


6 


Lucene 


6 


Mozilla 


6 


Hibernate 


5 



lar units turned out to 



Units. We captured when some 
unit related to the project was in the 
focus of the study: a bug report or 
a UML class diagram — namely, we 
would capture the fact when such 
unit was used to give quantitative in- 
formation (e.g., in a table presenting 
number of bug reports in the project 
under investigation). The most popu- 
be bug reports, they are used in 21 cor- 



http://softlang.uni-koblenz.de/empsurveyl 



^The project changed its name in 2008. 



TABLE IV. 



Online repositories and established datasets 



Repository 



# papers 



SourceForge 
Apache.org 
GitHub' 

Android Market* 
CodePlex^ 



Ref Dataset 


# papers 


14 


SIR 


3 


ll5j 
16 
17 


ISISR ciiallenge 

P-MARt 

PROMISE 


2 
2 
2 


18 


Qualitas 


2 



lhttp://sourceforge.net/ 

iittp://projects. apach e org/| 

" https://github.coni/ 

* Now known as Google Play, |https://play.goQgle.coni/store| 

http://www.codeplex.comA 



pora; defects (faults, failures) are used in 16 corpora; tests- 
in 10; traces — in 5. 



An average project-based corpus consists of source 
code of three open-source projects, written in Java. Eclipse 
or its sub-parts is used in one out of eight papers using 
project-based corpora. The projects used in at least five 
papers are JHotDraw, JEdit, Ant, ArgoUML, Firefox, 
Vuze/Azerus, Linux kernel, Lucene, Mozilla, and Hiber- 
nate. Within the corpus, bug reports, defects, tests, and 
traces can be in the focus of the study. 

Popular projects 



WCRE SCAM 

11 % 1 1 % 

Eclipse Lynx 

JEdit Minix 



ESEM 

13% 
Eclipse 



ICSM 

14% 

AmoUML 



CSMR 

23% 
Eclipse 



ICPC MSR 

24 % 28 % 

JEdit Firefox 

Eclipse 



Sources. When papers clearly state the source of their 
corpora, we collected such information. 

Online repositories used in more than one paper are listed 
in Table IIVI The rest of detected online repositories are used in 
only one paper each: BlackBerry App WorlcQ, Google CodqJ, 
LaunchpacjJ ShareJaiQ. 

Established datasets used in more than one paper are listed 
in Table |lVl Some of the other datasets that used only in one 
paper each: Bug prediction dataset |,19J . CHICKEN Scheme 
benchmarks Q CoCoMeB DaCapo S, FLOSSMetricS 
iBUGSE!LSMG2000 benchmarlQ SourcererDB [21], TEFSE 
challengq3- Table |V] summarizes the most popular types of 
sources and their distribution across conferences. 



One out of four project-based corpora uses an estab- 
lished dataset, previous work, or online repository as a 
source of the projects. There is no common frequently 



^ http://appworld.blackberry.com/webstore 
' http://code.google.eom/j 
*■ https://launchpad.net/ 
' http://www.sharejar.conV] 



"^ https://github.com/mario-goulart/chicken-ben chmarks] 

' ' http://agrausch.informatik.uni-kl.de/CoCoME 

*^ http://libresoft.es/research/projects/floss metrics] 

''http://www.st.cs.uni-saarland.de/ibugs/ 

'^ https://asc.llnl.gov/computing_resources/purple/archive/be nchmarks/smg/ 1 

'- http://www.cs.wm.edu/semeru/tefse201 1/Challenge.htm 



used dataset or repository. Only SourceForge shows mod- 
erately frequent usage. 

Usage of corpora sources 



ESEM SCAM 
13% 16% 



ICPC 

24% 



ICSM 

25% 



WCRE 

26% 



CSMR MSR 

30 % 39 % 



3) Evolution: We encountered 52 papers that use evolution 
of the projects in their research, meaning that they operate 
on several versions, releases, etc. To describe the evolution 
measure, the following terms were used: "version" (21 times), 
"revision" (11), "commit" (10), "release" (11). 

On the average, papers mentioning commits use 3,292 com- 
mits; papers with revisions — 18, 870 revisions; with versions — 
10 versions; with releases — 10 releases. 

There are 46 papers that mention a time span of their study. 
In 36 cases, the unit of the time span is a year and on the 
average such papers are concerned with a 8-year span. 

We found 23 papers to mention what version control 
system was involved in the study. CVS is mentioned 11 
times, SVN — 11 times. Git and Mercuiial — 4 and 2 times 
respectively. 



One out of three papers with project-based corpora 
uses evolution aspect in its research. In half of the 
cases, large-scale evolution is involved: several thousands 
commits/revisions or ten versions/releases of projects — 
often spanning several years of a project's lifetime. 

Evolution usage 



ICPC SCAM 
14% 16% 



ESEM 

21% 



CSMR 

33% 



ICSM 

33% 



WCRE MSR 

33 % 56 % 



4) Requirements: We collected requirements to the cor- 
pora: explicit as well as implicit. For instance, an implicit 
requirement for a bug tracking system is inferred if the paper 
uses bug reports of the projects under investigation. The most 
popular direction of requirements is the presence of some 
'ecosystem' (found in 37 papers): existence of bug tracking 
systems, mailing lists, documentation (e.g., user manuals). 
Another popular requirement, found in 25 papers, has to do 
with the size of the projects: small, sufficient, large, or of 
particular size (as specific as "medium of the sizes of the ten 
most popular Sourceforge projects"), or the need of diversity 
of sizes. In 23 papers, it was stated that the used projects were 
chosen because they were used in previous work (of the same 
or other authors). Language-related requirement was present 
in 22 papers for a specific language or for the diversity of 
languages in a corpus. In 14 papers, the choice of projects 
was attributed to either diversity of application domains or 
to a specific domain. Some aspect of the used projects was 
mentioned as essential in 14 papers: active or wide-spread 
usage, popularity, well-known and established software. Other 
popular requirements include presence of development history 
(15 papers), dependencies (11 papers), or tests (10 papers). 



TABLE V. 



Sources of corpora 



Type 








# papers 










Total 


CSMR 


ESEM 


ICPC 


ICSM 


MSR 


SCAM 


WCRE 


Established dataset 


20 


5 





2 


6 


5 





2 


Previous work 


13 


2 


3 


2 


1 


1 


2 


2 


Online repository 


12 


3 





I 


2 


2 


I 


3 


Total 


43 


9 


3 


5 


9 


7 


3 


7 


Percentage 


25 


30 


13 


24 


25 


39 


16 


26 



One out of five papers requires the projects of its 
corpus to have an ecosystem: a bug tracker, or a maihng 
list, or some kind of documentation. Other requirements 
focus on the size and language of the projects, application 
domain, development history, etc. 

Popular requirements 



SCAM MSR 

1 1 % 28 % 
domain ecosys 



ICSM 

14% 
size 



ICPC 

24% 
size 



ESEM 
13% 
ecosys 



WCRE CSMR 

1 1 % 23 % 

ecosys ecosys 



lang 



p. work 



5) Tuning: We captured what kind of action is applied to a 
corpus during research. In 20 papers, sources or binaries were 
modified by instrumentation, faults/clones injection, adjusting 
identifiers, etc. In 15 papers, tests needed to be run against the 
corpus either to verify made modifications or to collect the 
data. In 10 papers, the corpora had to be executed in order 
to perform the needed analysis or to collect data. In 6 papers, 
some filtering of the contents of the corpus was needed to, 
e.g., identify main source code/main part of the project. 



We have detected few common actions applied to cor- 
pora during research: source code/binaries modification; 
execution of the tests on the corpus or of the corpus itself; 
filtering of the corpus contents. Altogether, one out of four 
papers contains signs of one of these actions. 

Popular actions 



ESEM 


MSR 


SCAM 


ICPC 


ICSM 


WCRE 


CSMR 


8% 


11% 


11% 


14% 


19% 


19% 


20% 


tests 


run 


modif. 


modif. 


tests 


modif. 


modif. 



We captured manual effort that went into creation of a 
corpus, e.g., when a paper mentions setting up environments 
and providing needed libraries in order to execute the corpus. 
For that, we graded each corpus on the following scale. None: 
no manual effort mentioned (120 corpora); some: some manual 
effort mentioned, e.g., manual detection of design patterns in 
source code (33 corpora); and all means that corpus is self- 
written (10 corpora). 



One out of four project-based corpora requires some 
manual effort. 



Manual effort 



CSMR ESEM 

13% 13% 



ICSM 

22% 



SCAM 

26% 



ICPC 

33% 



MSR WCRE 

33 % 37 % 



B. Self-classification 



TABLE VI. Self 

CLASSIFICATION 



We collected explicit self- 
classifications from the papers; 
from the sentences like "we 
have conducted a case study" 
we would conclude that the cur- 
rent paper is a case study. Some 
of the self-classifications were 
very detailed and precise, e.g., "a 
pre/post-test quasi experiment", 
in such cases we reduced the type 
to a simpler version, e.g., an ex- 
periment. We would also count 
terms like "experimental assessment" or "experimental study" 
towards the experiment type. As seen from Table |Vl] most 
often authors use terms such as "case study" and "experiment" 
to describe their research. In some cases, papers contain more 
than one self-classification (24 cases). In 36 papers, we could 
not detect any self-classification. 



Type 

case study 
experiment 
empirical study 
evaluation 
exploratory study 



# 

48 
44 

22 

14 

6 



Four out of five papers provide self-classification, but 
it might be vague. The most popular term, 'case study,' 
may be misused. Cf., "There is much confusion in the SB 
literature over what constitutes a case study. The term is 
often used to mean a worked example. As an empirical 
method, a case study is something very different." fl^. 
Cf., "... our sample indicated a large misuse of the term 
case study." ||2l 

Self-classiflcation 



SCAM MSR 

37 % 61 % 



CSMR 

80% 



WCRE 

81% 



ICPC 

86% 



ESEM ICSM 

88 % 94 % 



C. Emerged forms 

Independently of the self-classification of the papers, we 
noted structural characteristics of research performed in the 
papers. We did not use any theoretical definition for what 
to consider a questionnaire or an experiment. The developed 
definitions are structural, composed of the characteristics that 



emerged from the papers, as they were discussed and struc- 
turally supported by the authors. 

1) Experiment: We have identified 22 experiments in 19 
papers. Except for two, they all involve human subjects. On 
the average, an experiment has 16 participants. The maximum 
number of participants is 128, the minimum is 2, first and 
third quartiles are 5 and 34 respectively. In 21 cases, an 
experiment uses a corpus (in 17 cases, a project-based one); 
20 questionnaires are used in 10 experiments. 

In two-thirds of the experiments, participants come from 
one population, the remaining experiments draw participants 
from two or three populations. The most common source 
of participants is students; sometimes distinguished by their 
level — graduate, undergraduate. Bachelor, Master, and PhD 
students. In one-third of the cases, professionals are involved 
(full-time developers, experts, industry practitioners, etc.). In 
half of the cases, participants form the only group in the 
experiment. When there is more than one group (usually, 
two — with a couple of exceptions of 4 and 5 groups), the group 
is representing a treatment (a task), or an experience level, or 
a gender On the average, an experiment has 4 tasks and lasts 
for an hour (with a few exceptions when an experiment takes 
several weeks or even a month). 

In 6 cases, it is mentioned that an experiment had a pilot 
study. In 6 cases, it is mentioned that participants of the 
experiment were offered compensation: monetary or another 
kind of incentive (e.g., a box of candy). 

The main requirement for the participants is their expe- 
rience; basic knowledge of used technology, or language, or 
IDE. As for the tasks, they are expected to be of a certain 
size (e.g., a method body to fit on one page), or of certain 
contents (e.g., contain "if" statements). The usual requirement 
for an experiment also is either that the tested tool or used 
code is unfamiliar to the participants, or on the contrary that 
the background is familiar (e.g., well-known design patterns). 



One out of ten papers contains an experiment. The 
majority of the experiments use project-based corpora; 
experiments often use questionnaires, usually two per 
experiment. An average experiment involves 16 students, 
often in two groups (by the received treatment or experi- 
ence level); it consists of four tasks and lasts for an hour 
One out of four experiments suggests some compensation 
to its participants; one out of four experiments is preceded 
by a pilot study. 

ICPC and ESEM are the main source of experiments 
involving professionals. 

Experiments 



MSR SCAM 
% % 



CSMR 

3% 



WCRE 

7% 



ICSM 



ESEM ICPC 

21% 38% 



2) Questionnaire: Altogether, we have found 36 question- 
naires in 24 papers. As mentioned, 20 questionnaires are 
used in experiments — to distinguish, we will refer to them 
as experiment-related and the other 16 we will qualify as 
experiment-unrelated. 



Sizewise, there is no particular difference between 
experiment-related and -unrelated questionnaires. On the aver- 
age, both have 20 questions grouped in one section. In 6 cases, 
an experiment-unrelated questionnaire has a corpus. 

While experiment-related questionnaires have the same 
participants as the experiments they relate to (i.e., involve 
mostly students), experiment-unrelated questionnaires involve 
professionals (testers, managers, experts, consultants, software 
engineers) as participants in two-thirds of the cases. On the 
average, an experiment-unrelated questionnaire has 12 partici- 
pants. When it was possible (6 cases), we calculated how many 
participants took part in the experiment-unrelated questionnaire 
compared to the initial number of questioned people. On the 
average, 19 % take part in the end, in the worst case the ratio 
can be as low as 5 %. 

While experiment-related questionnaires have the same 
requirements regarding the participants as the experiments they 
relate to, experiment-unrelated questionnaires have require- 
ments concerned with the participants' experience (e.g., Java 
experience) or expertise (specific area of experience such as 
clone detection or web development). 

When related to experiments, questionnaires are often 
performed before (referred to as "pretest" in 6 cases) and after 
the experiment (referred to as "posttest" in 9 cases). 

In 5 cases, an experiment-unrelated questionnaire was 
preceded by a pilot study. 



More than half of the detected questionnaires are 
used in experiments — often as pretest and posttest ques- 
tionnaires. The other half, experiment-unrelated ques- 
tionnaires, are found in one out of twelve papers. 
Sizewise, on the average there is no difference be- 
tween experiment-related and -unrelated questionnaires. 
Experiment-unrelated questionnaires usually involve pro- 
fessionals as participants — in contrast to experiment- 
related questionnaires that mostly use students. Typi- 
cal requirements for participants in experiment-unrelated 
questionnaires have to do with experience or expertise. 
One out of three experiment-unrelated questionnaires are 
preceded by a pilot study. 

Experiment-unrelated questionnaires 



MSR CSMR 

% 3 % 



SCAM 

5% 



WCRE 

7% 



ICSM 

8% 



ICPC ESEM 

19 % 25 % 



3) Literature survey: We have found 6 literature surveys 
in 5 papers. Except for one, they provide extensive details 
on how the survey was conducted. In particular, the used 
methodology is clearly stated: four times it is said to be a 
"systematic literature review" and once a "quasi systematic 
literature review". In three cases, the systematic literature 
review was done following guideUnes by Kitchenham Il22l . 

The papers are initially collected either by searching digital 
libraries or from the proceedings of specific conferences and 
journals. Among used digital libraries are EI Compendex, 
Google Scholar, ISI, and Scopus — the latter was used in 
two papers. As for the conferences and journals, there is 



no intersection between the lists of names — except for ICSE, 
which was used in two papers. 

On the average, a literature survey starts with 2161 papers, 
its final set contains 35 papers, meaning that on the average 
only 1 .6 % papers are taken into account in the end. The 
percentage can be as high as 39 % and as low as %. 

Requirements for papers to be included into the survey 
are usually related to the scope of the investigated research. 
Other requirements are concerned with the paper itself: avail- 
able online, written in English, a long paper, with empirical 
validation. 

After all the papers are collected, they are filtered based 
on the titles and abstracts, which are examined manually by 
the researchers (in one case, also conclusions were taken into 
account; in another case, full text of the papers was searched 
for keywords). Then the full text of each paper is read and 
the final decision is made as to whether to consider the paper 
relevant. 



TABLE VII. 



Existing tools used in the papers 



Literature surveys are quite rare: only one out of 35 
papers contains it. On the average, a literature survey 
starts with few thousand papers to be filtered down to 
few dozens papers that will be analyzed. Usually, the 
first round of filtering is based on the title and abstract, 
then the full text of the papers is considered. There 
is not enough information to conclude about frequently 
used digital libraries or conferences/journals. Half of the 
surveys were following guidelines of systematic literature 
reviews by Kitchenham 1221 . 

Literature surveys 



ICSM MSR 
% % 



SCAM 
0% 



WCRE 

0% 



CSMR 

3% 



ICPC ESEM 

5 % 13 % 



4) Comparisons: During coding, we noticed the recurring 
motif of comparisons in the papers. While we did not assess 
the scope nor the goal, we have coded the basic information: 
what is the nature of the subjects being compared (tools, 
techniques), how many subjects are compared, and is one of 
them introduced in the paper 

We have found comparisons in 56 papers. Almost all of 
them (except for 5 papers), use project-based corpora. Half of 
the time, a comparison is made for the technique, approach, 
or tool that was introduced in the study — with the apparent 
reason to evaluate the proposed technique, approach, or tool. 
On the average, such evaluation involves one other technique, 
approach, or tool. In the other cases, compared were: metrics, 
tools, algorithms, designs, etc. For such comparisons, on the 
average, the group of compared entities was of size 3. 



One out of three papers compares tools, techniques, 
approaches, metrics, etc. — half of the time, to evaluate 
what was introduced in the study. On the average, such 
evaluation involves one other entity. In the other half of 
the cases, the average number of compared entities is 3. 



Tool 



# papers 



# papers 



Eclipse 

R project 

CCFinder' 

Understand 

Weka' 

ConQAT" 



25 
16 
6 
6 
6 
4 



mallet' 


4 


ChangeDistiller" 
CodeSuifer" 


3 
3 


Evolizer'" 


3 


RapidMiner" 

recoder'^ 


3 
3 



http://eclipse.org 

http://www.r-project.org/ 

http://www.ccfinder.net/ 

http://www.scitools.coiTi/ 

http://www.es. waikato.ac.n z/ml/weka/| 

https ://www.conqat.org/ 

http://mallet.cs.umass.edu/ 

http://www.ifi.uzh.ch/seal/research/tools/changeDistiller.htm 

http://www.grammatech.com/products/codesuifer/o verview.htmll 

' http://www.ifi. uzh.ch/seal/research/tools/evoliz errhtmll 
" http://rapid-i com/content/view/l 81/190/ 
^^]http://sbui:ceforge.net/prQJects/recoder/| 



Comparisons 



ICPC MSR 

19% 22% 



SCAM 
26% 



WCRE 

30% 



ESEM 

33% 



ICSM CSMR 

36 % 47 % 



D. Tools 

We have found 46 papers to introduce a tool (where we 
were able to capture this fact only if the name of the tool 
was mentioned or it was clearly stated that "a prototype" is 
implemented). In 46 more papers, we detected that additional, 
helper tooling for the current purpose of the study is imple- 
mented (parsers, analyzers, and so on). 

When names of existing tools were explicitly mentioned to 
be used, we collected the names. We have found that in 126 
cases, a paper makes use of existing tools. On the average, a 
paper uses 2 tools; the captured maximum is 6. The frequently 
used tools are listed in Table IVIII We counted towards Eclipse 
usage also cases when a paper used an existing tool that we 
know to be an Eclipse plug-in. For brevity, we omit names of 
19 tools each of which was used in two papers. 



One out of four papers introduces a new tool; another 
one out of four papers uses some home-grown tooling. 
Almost three out of four papers use existing tools. 

The most popular standard tool. Eclipse — an IDE and 
a platform for plug-in development — is used in one out 
of seven papers. Other popular tools cater for source code 
analysis, clone detection, evolution analysis, data mining, 
statistics, quality analysis, document classification. 

Home-grown tooling 



ICPC 

5% 


ESEM 

17% 


CSMR ICSM MSR 

20 % 33 % 33 % 


WCRE 

33% 


SCAM 

42% 


Introduced tools 


ESEM 

4% 


MSR 

11% 


ICPC SCAM CSMR 

19% 21% 30% 


WCRE 

37% 


ICSM 

44% 



E. Structural signs of rigorousness/quality 

We do not aim to assess the quality or rigorousness of 
the studies. We capture presence of some of the aspects that 
are taken into account when assessing rigorousness/quality of 
research (cf., [1231 ) — in that, we restrict ourselves only to the 
structural aspects. 

1) Study presentation aspects: A clear set of definitions 
for the terms used in the paper is found in 25 papers. 
Research questions are adopted in 83 papers. In 22 papers, a 
"Goal-Question-Metric" approach is used. Explicit mention of 
null hypothesis or hypotheses is found in 23 papers. Section 
"Threats to validity" is present in 111 papers; of them, 75 
discuss threats using classification described, e.g., in ll24ll : 
threats to external (mentioned in 73 papers), internal (59 
papers), construct (53 papers), and conclusion (26 papers) 
validity. 

If to consider combinations of these signs (definitions, 
research questions, hypotheses, and threats), the most popular 
one is the absence of all of them: demonstrated by 42 pa- 
pers. The second most popular combination is presence of 
research questions and threats to validity: found in 34 papers. 
The third most popular — usage of only threats to validity — 
found in 29 papers. Together, these three combinations de- 
scribe 60 % of the papers. 



Half of the papers use research questions to structure 
their study. One out of seven papers uses a "Goal- 
Question-Metric" approach and/or formulate (null) hy- 
potheses to structure their research. One out of seven 
papers provides an explicit set of definitions of the terms 
used in the study. Threats to validity are discussed in three 
out of five papers. 

The following three combinations of structural signs 
describe at least half of the papers in each conference, 
except for WCRE, where only 44 % of papers are covered 
by these combinations. 

No structural signs 



ICPC 

14% 


MSR WCRE ESEM ICSM 

17% 19% 21% 22% 


CSMR 

27% 


SCAM 

53% 


Both research questions and threats to validity 


ICSM 

8% 


WCRE SCAM CSMR MSR 
11% 16% 20% 22% 


ICPC 

24% 


ESEM 

42% 


Only threats to validity 


ESEM 

4% 


MSR SCAM WCRE CSMR 

11% 11% 15% 17% 


ICPC 

19% 


ICSM 

31% 



2) Validation: We captured the mentions of performed 
validation of done research. We have found evidence of some 
kind of validation in 88 papers. In 50 cases, validation was 
manually performed: either the results are small enough, or a 
sufficient subset is checked. In 27 cases, validation was done 
against existing or prepared results: actual data (when evalu- 
ating predictions), data from previous work, or an oracle/gold 
standard. In 8 cases, cross-validation was used. 



F. Reproducibility 

We looked for signs of additionally provided data for 
a replication of the study. Since it is usually done via the 
Internet, we searched the papers for (the stems of) the follow- 
ing keywords: "available," "download," "upload," "reproduce," 
"replicate," "host,", "URL," "website," "http," "html". In such 
manner, we have found links in 61 papers. In 6 cases, we could 
not find any mentioned material, tools or data, — links led to a 
general page or to a homepage, which we searched thoroughly 
but without success. In 3 more cases, we have found replication 
material on the website after some searching. 



One out of three papers additionally provides online 
some data from the study, though not always to be found. 

Additional data provided 



SCAM ICSM 

26%, 31% 



CSMR 

33% 



MSR 

33% 



WCRE 

33% 



ESEM ICPC 

38 % 48 % 



As to the nature of the provided data, in 25 cases, an 
introduced tool or tooling used in the research is provided. In 
15 cases, the used corpus — in full or partially — is provided; 
the complete description of the corpus (list of used projects 
with their versions and/or links) is provided by 6 papers. Raw 
data is available for 14 papers; the same number of papers 
provide final or/and additional results of the study. 

When the corpus is not provided by the paper, but the 
names of the used projects are mentioned, the main aspect of 
being able to reproduce the corpus is knowing which versions 
of the projects were used. We noticed that in 21 papers versions 
of the used projects are not provided. In 67 papers, versions 
of the projects are mentioned explicitly; in 26 more cases, it 
is possible to reconstruct the version from the mentioned time 
periods that the study spans. 

Altogether, we judged 29 papers to be reproducible, mean- 
ing that either all components were provided by the authors or 
we concluded that the paper contains enough details to collect 
exactly the same corpus and the same tools. We did not judge 
if it is possible to follow the provided instructions, specific to 
the reported research. 

We also would like to note that 8 papers mention that they 
are doing a replication in their study, of them 3 papers with 
self-repUcation. 



We judged one out of six papers to be reproducible 
with respect to the used corpus and tools. We did not 
assess whether enough details were provided to re-conduct 
the research itself. 

Judged to be reproducible 



ICSM WCRE 

3 % 4 % 



SCAM 
16% 



ICPC 

19% 



ESEM 

25% 



CSMR MSR 

27 % 33 % 



G. Assessment 

Though usually information we extracted from the papers 
was scattered across different sections, half of the papers had 



tables (listing projects, their names, versions, used releases, 
and similar information) that helped us during coding. We 
captured our confidence in the coded profile of each paper 
For that, we used the following scale: high, moderate, and 
low level of confidence. The results are as follows: high — 81 
papers, moderate — 78 papers, low — 16 papers. 



We have low confidence in one out of eleven papers 
that we have coded. In the rest, half of the time we 
are moderately confident and half of the time — highly 
confident in the results. 

High confidence 



WCRE 

15% 


SCAM 

26% 


ICSM ICPC ESEM 

42 % 52 % 54 % 


CSMR 
60% 


MSR 

78% 


Moderate confidence 


MSR 

17% 


CSMR 

33% 


ESEM ICPC ICSM 

33 % 43 % 53 % 


SCAM 

58% 


WCRE 

67% 


Low confidence 


ICPC 

0% 


ICSM 

6% 


MSR CSMR ESEM 

6% 7% 13% 


SCAM 
16% 


WCRE 
19%. 



IV. Related work 

We summarized related work — in the sense of other litera- 
ture surveys on SE research — in the introduction and Table [] 
Thus, the key differences between our survey and previous 
work are these: i) predefined schema in previous work versus 
emerged schema in the present survey; ii) focus on specific 
forms or characteristics of SE research in previous work versus 
broad analysis of empirical evidence in the present survey. 
Below we compare the findings of related work where they 
overlap with ours. 

As a general remark, we believe that our findings quantita- 
tively differ from previous findings because of several factors: 
i) the dependence on the choice of venues: even conferences 
in our study differ considerably; ii) passed time: there is at 
least a five-year gap, during which popularity of empirical 
research and of its particular forms might have grown; iii) 
the cited papers use mostly journals: this may increase the 
aforementioned gap because of the longer process for journal 
publications; iv) snapshot versus longitudinal approach: we 
take into account all papers of the latest proceedings while 
the cited papers focus on a sample across several years. 

The closest work to ours is by Zannier et al. fl]: they 
measured quantity and quality of empirical evaluation in ICSE 
papers over the years. Our work provides a snapshot study 
aiming to represent SE research broadly across conferences. 
Zannier et al. when assigning types to the papers, could 
confirm the self-classification of half of the studies. Which 
agrees with our observation that self-classification is rather 
weak among SE papers. They also observe the extremely 
low usage of hypotheses (only one paper) and absence of 
replications. We do find some adoption of null hypotheses and 
replications. 

According to their classification. Glass et al. fSl have 
found 1.1% papers to contain literature reviews and 3 % papers 



to present "laboratory experiment (human subjects)." We also 
discover that number of literature surveys and experiments is 
low, but relatively it increased 2-3 times. 

Kitchenham et al. |[8| considered only 0.75% of surveyed 
papers to be systematic literature reviews. We have found 
literature surveys in 5 papers, one of which did not contain a 
clear methodology — a requirement to be met by Kitchenham' s 
inclusion criteria — leaving 4 papers. Thus, our percentage of 
detected literature surveys is 2.3 % 

According to Sj0berg et al.'s study [6 1, only 2 % of the 
papers contain experiments, while we discover 10% surveyed 
papers to contain an experiment. On the average, Sj0berg 
et al. detected an experiment to involve 30 participants — 
in 72.6% cases only students, in 18.6% cases only pro- 
fessionals, and in 8 % cases mixed groups. We have found 
that on the average an experiment involves 16 participants — 
in 57 % cases only students, in 14 % cases only professionals, 
and in 29 % cases mixed groups. 



V. Threats to validity 

1) Choice of the papers: We did not use journal articles — 
while they might provide more information or be of higher 
quality, we wanted to capture the state of the common research, 
of which we believe conference proceedings to be more 
representative. We have chosen conferences with proceedings 
of similar and reasonable size: so that not to skew the general 
results by one larger conference and so that to include all 
the papers but still be able to process them within reasonable 
period of time. Specifically, we excluded the ICSE conference, 
which had 87 long papers in the proceedings of 2012 edition. 
Altogether, this means our results might not be generalizable, 
but we believe them to be representative enough. 

2) Choice of the period: Since we perform a snapshot 
study, it might be that some of the discovered numbers are 
a coincidental spike. A longitudinal study — ^possible future 
work — ^may provide more details and deeper understanding. 

3) Coding: The effort was manual with occasional search 
by specific keywords (mentioned in the appropriate subsections 
of Section Ullb. In 5 cases, papers were OCR-scanned. 

Human factor. Coding was done by one researcher, but the 
results of the first pass were cross-validated during the second 
pass as well as during the aggregation phase. When in doubt, 
the researcher constantly referred back to the surveyed papers 
to double-check. 

Scheme. We do not claim our coding scheme to be 
complete or advanced. We captured basic data related to the 
used empirical evidence, often either obvious or structurally 
supported. Therefore, we might miss sophisticated or under- 
specified forms of empirical research. 



VI. Conclusion 

In this paper, we presented a literature survey on empirical 
evidence in Software Engineering research. 



Answers to the research questions: Coming back to the 
initial questions that motivated our research (see Section |lll), 
we suggest the following answers: 

I The overwhelming majority of surveyed conference pa- 
pers use corpora — collections of empirical evidence. 
II The majority of the corpora consist of projects and can 
be characterized by size, code form, software language, 
evolution measures, requirements, and applied tunings. 
Ill There are no frequently used projects or corpora across 
all the papers. We have detected though some pattern of 
project recurrence with low frequency. 

In what follows, we further interpret these findings. 

No "holy grail": Though corpora are used in the majority 
of the surveyed papers and some clusters of characteristics of 
the used corpora are recurrent (e.g., the use of many open 
source Java projects), the usage of established datasets is low. 
We suggest two possible reasons. First, adoption may be low 
only yet: among detected datasets being used (see Table |IV| ), 
the oldest dataset, SIR, was introduced in 2005, the youngest, 
Qualitas — in 2010. Second, researchers may prefer to collect 
and prepare their corpora themselves, because there might not 
be a "holy grail" among corpora to suit all possible needs. 
Partially, this assumption is supported by the fact that, even 
on the level of projects, no clear favorite was detected among 
the papers. The emerged schema with its components for 
requirements and tunings for corpora also substantiates indeed 
the different needs of research efforts. 

Community-specific curated collections: On the other 
hand, we find that three out of seven conferences have favorite 
projects, when considered separately — ^projects that are used by 
a quarter of the papers within these conferences. This leads to 
a refined version of the third question in our study: When it is 
possible to detect commonly used projects within a conference, 
would it be useful to provide a curated version of them? 
Generally, it is clear that even requirements and tunings are 
recurrent across research efforts, and hence, some "product 
line" of curated collections and some discipline of "corpus 
engineering" may ultimately lead to more reuse of empirical 
evidence. These are topics for future work. 

Top-down vs. bottom-up introduction of methodology: 
While there is a need for adoption of advanced and theoret- 
ically specified forms of empirical research, we believe that 
there is a certain amount of de facto empirical research in 
Software Engineering that has formed historically. This survey 
sought to understand the characteristics of empirical evidence 
in research — also to enable assessment, if not improvement, 
of research quality. Future work includes aligning research 
areas or goals with the kind of used empirical evidence: 
deeper understanding of the needs may provide insights for 
streamlining research. 
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